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(54) Subband image encoding method 

(57) A method of image encoding using subband 
decomposition followed by separate coding of the high- 
est level lowpass image plus zerotree coding of the 
higher bands. 
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Description 

BACKGROUND OF THE INVENTION 

The invention relates to video communication meth- 5 
ods and devices, and, more particularly, to digital com- 
munication and storage systems using compressed 
video data. 

Video communication (television, teleconferencing, 
and so forth) typically transmits a stream of video 10 
frames (pictures, images) along with audio over a trans- 
mission channel for real time viewing and listening or 
storage. However, transmission channels frequently 
add corrupting noise and have limited bandwidth. Con- 
sequently, digital video transmission with compression is 
enjoys widespread use. In particular, various standards 
for compression of digital video have emerged and 
include H.261, MPEG-1, and MPEG-2, with more to fol- 
low, including in development H.263 and MPEG-4. 
There are similar audio compression methods. 20 

Tekalp, Digital Video Processing (Prentice Hall 
1995), Clarke, Digital Compression of Still Images and 
Video (Academic Press 1995), and Schafer etal, Digital 
Video Coding Standards and Their Role in Video Com- 
munications, 83 Proc. IEEE 907 (1995), include sum- 25 
maries of various compression methods, including 
descriptions of the H.261, MPEG-1, and MPEG-2 
standards plus the H.263 recommendations and indica- 
tions of the desired functionalities of MPEG-4. 

H.261 compression uses interframe prediction to 30 
reduce temporal redundancy and discrete cosine trans- 
form (DCT) on a block level together with high spatial 
frequency cutoff to reduce spatial redundancy. H.261 is 
recommended for use with transmission rates in multi- 
ples of 64 Kbps (kilobits per second) to 2 Mbps (mega- 35 
bits per second). 

The H.263 recommendation is analogous to H.261 
but for bitrates of about 22 Kbps (twisted pair telephone 
wire compatible) and with motion estimation at half-pixel 
accuracy (which eliminates the need tor loop filtering 40 
available in H.261) and overlapped motion compensa- 
tion to obtain a denser motion field (set of motion vec- 
tors) at the expense of more computation and adaptive 
switching between motion compensation with 16 by 16 
macroblock and 8 by 8 blocks. 45 

MPEG-1 and MPEG-2 also use temporal prediction 
followed by two dimensional DCT transformation on a 
block level as H.261, but they make further use of vari- 
ous combinations of motion-compensated prediction, 
interpolation, and intraframe coding. MPEG-1 aims at so 
video CDs and works well at rates about 1-1 .5 Mbps for 
frames of about 360 pixels by 240 lines and 24-30 
frames per second. MPEG-1 defines I, P. and B frames 
with I frames intraframe, P frames coded using motion- 
compensation prediction from previous I or P frames. 55 
and B frames using motion-compensated bi-directional 
prediction/interpolation from adjacent I and P frames. 

MPEG-2 aims at digital television (720 pixels by 



480 lines) and uses bitrates up to about 10 Mbps with 
MPEG-1 type motion compensation with I, P, and B 
frames plus added scalability (a lower bitrate may be 
extracted to transmit a lower resolution image). 

However, the foregoing MPEG compression meth- 
ods result in a number of unacceptable artifacts such as 
Stockiness and unnatural object motion when operated 
at very-low-bit-rates. Because these techniques use 
only the statistical dependencies in the signal at a block 
level and do not consider the semantic content of the 
video stream, artifacts are introduced at the block 
boundaries under very- low-bit-rates (high quantization 
factors). Usually these block boundaries do not corre- 
spond to physical boundaries of the moving objects and 
hence visually annoying artifacts result. Unnatural 
motion arises when the limited bandwidth forces the 
frame rate to fall below that required for smooth motion. 

MPEG4 is to apply to transmission bitrates of 10 
Kbps to 1 Mbps and is to use a content-based coding 
approach with functionalities such as scalability, con- 
tent-based manipulations, robustness in error prone 
environments, multimedia data access tools, improved 
coding efficiency, ability to encode both graphics and 
video, and improved random access. A video coding 
scheme is considered content scalable if the number 
and/or quality of simultaneous objects coded can be 
varied. Object scalability refers to controlling the 
number of simultaneous objects coded and quality scal- 
ability refers to controlling the spatial and/or temporal 
resolutions of the coded objects. Scalability is an impor- 
tant feature for video coding methods operating across 
transmission channels of limited bandwidth and also 
channels where the bandwidth is dynamic. For exam- 
ple, a content-scalable video coder has the ability to 
optimize the performance in the face of limited band- 
width by encoding and transmitting only the important 
objects in the scene at a high quality. It can then choose 
to either drop the remaining objects or code them at a 
much lower quality. When the bandwidth of the channel 
increases, the coder can then transmit additional bits to 
improve the quality of the poorly coded objects or 
restore the missing objects. 

For encoding an I frame, Shapiro, Embedded 
Image Coding Using Zerotrees of Wavelet Coefficients, 
41 IEEE Tr.Sig.Proc 3445 (1993) provides a wavelet 
hierarchical subband decomposition which groups 
wavelet coefficients at diffrent scales and predicts zero 
coefficients across scales. This provides a fully embed- 
ded bitstream in the sense that the bitstream of a lower 
bitrate is embedded in the bitstream of higher bitrates. 

Villasenor et al, Wavelet Filter Evaluation for Image 
Compression, 4 IEEE Tr. Image Proc. 1053 (1995) dis- 
cusses the wavelet subband decomposition with vari- 
ous mother wavelets. 

However, more efficient coding at low bitrates 
remains a problem. 

Hardware and software implementations of the 
K261, MPEG-1, and MPEG-2 compression and 
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decompression exist. Further, prgrammable microproc- 
essors or digital signal processors, such as the 
Ultrasparc or TMS320C80, running appropriate soft- 
ware can handle most compression and decompres- 
sion, and less powerful processors may handle lower 
bitrate compression and decompression. 

SUMMARY OF THE INVENTION 

The present invention provides video compression 
and decompression with hierarchical subband (includ- 
ing wavelet) coding using zerotrees but with an initial 
partition of subbands into the baseband plus three sets 
of higher bands according to hierarchy: first, code the 
baseband separately (such as by DPCM) and then 
zerotree code each of the sets of higher bands with, its 
own initial threshold. 

The present invention also provides video systems 
with applications for this coding, such as video teleprv 
ony and fixed camera surveillance for security, including 
time-lapse surveillance, with digital storage in random 
access memories. 

Advantages include more efficient low bitrate video 
coding than embedded zerotree wavelet coding while 
retaining the fully embedded feature. This permits low 
bitrate teleconferencing and also surveillance informa- 
tion storage. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention will now be further described 
by way of example, with reference to the accompanying 
drawings in which: 

Figure 1 is a flow diagram for a preferred embodi- 
ment encoding method in accordance with the 
present invention; 

Figures 2a-c illustrates subband hierarchical 
decomposition; 

Figures 3a-c show empirical results; 
Figure 4 shows a preferred embodiment telephony 
system implementing the present invention; 
Figure 5 illustrates a preferred embodiment surveil- 
lance system implementing the present invention; 
and 

Figure 6 is a flow diagram for a preferred embodi- 
ment video compression technique in accordance 
with the present invention. 

DESCRIPTION OF THE PREFERRED EMBODI- 
MENTS 

Preferred embodiment single frame zerotree coding 

Figure 1 is a flow diagram of a first preferred 
embodiment frame encoding using wavelet hierarchical 
decomposition together with PCM in the baseband plus 
zerotrees in the higher bands. The flow diagram will be 



explained with the help of an example for simplicity; thus 
presume a 144 by 176 frame of 8-bit pixels (-128 to + 
127) and presume four scale levels in a wavelet hierar- 
chical decomposition. The value of the pixel at (j,k) may 

5 be denoted x(j,k) for Ctej£143 and O^ksl 75. 

To begin the decomposition, first filter the 144 by 
176 frame with each of the four filters h 0 (j)h o (k), 
hoWMk), h^hofk). and h^h^k) to give 144 by 176 
filtered frames (boundary pixel values are used to 

w extend the frame for computations which otherwise 
would extend beyond the frame). A computationally 
simple h 0 (k) function equals 1/V2 at k=0,l, and is zero 
for all other k; h^k) equals 1/V2 at k=0. -1/V2 at k=i, 
1/8V2 at k=2,3, -1/8V2 at k=-l,-2, and zero for all other 

15 k The Villasenor article cited in the Background lists 
other filter functions. The filtering is a mathematically 
convolution with the functions, so h Q is a lowpass filter 
in one dimension (averages over two adjacent pixels) 
and h, is a highpass filter in one dimension (essentially 

20 a difference of adjacent pixels). Thus the four filters are 
two-dimensional lowpass-lowpass, lowpass-highpass, 
highpass-lowpass, and highpass-highpass, respec- 
tively. 

r Next, subsample each filtered frame by a factor of 

25 four by retaining only pixels at (j, k) with j and k both even 
integers. This subsampling will yield four 72 by 88 pixel 
images, denoted LL1, LH1, HL1, and HH1, respectively, 
with pixel locations (j.k) relabelled for (Kj£71 and 
(Kk£87. This forms the first level of the decomposition, 

30 and the four images can be placed together to form a 
single 1 44 by 1 76 image which makes visualization of 
the decomposition simple as illustrated in Figure 2a. 
Thus LL1 is a lower resolution version of the original 
frame and could be used as a compressed version of 

35 the original frame. 

The LL1, LH1, HL1, and HH1 images can be used 
to reconstruct the original frame by first interpolating 
each image, by a factor of four (to restore the 1 44 by 1 76 
size), then filtering the four 144 by 176 images with fil- 

40 ters g o 0)go(k), g 0 (j)9i(k). gi(j)9o(k). and gtfjg^k), 
respectively, and lastly pixelwise adding these four fil- 
tered images together. The functions go and are low- 
pass and highpass filters, respectively, and relate to ho 
and h, by gO(n) = (-I) h^n) and g^n) = (-l)h Q (n). The h 0> 

45 h! , g 0 . and g t functions are symmetric about 1/2, rather 
than O as would be the case for an odd tap filter, so after 
reconstruction the pixel index is shifted by 1 to adjust for 
the two 1/2 pixel shifts during two filterings. The second 
level in the decomposition simply repeats the four filter- 

50 ings with the ho and h, functions plus subsampling by a 
factor of four but using the LL1 image as the input. Thus, 
the four filtered images are each 36 by 44 and denoted 
LL2, LH2, HL2, and HH2. As before, the LL2, LH2. HL2. 
and HH2 can be arragned to visualize the decompos- 

55 tion of LL1 and also could be used for recontruction of 
LL1 with the g Q and g., based filters. The LH1 , HL1 . and 
HH1 images remain unfiltered. Repeat this decomposi- 
tion on LL2 by filtering with the four filters based on ho 
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and h, followed by subsampling to obtain LL3, LH3, 
HL3, and HH3 which are 18 by 22 pixel images. Again, 
LL3, LH3, HL3, and HH3 can be arranged to visualize 
the decomposition of LL2. 

Complete the hierarchical four level decomposition 
of the original frame by a last filtering with the four filters 
based on h 0 and followed by subsampling of LL3 to 
obtain LL4, LH4, HL4, and HH4 which are 9 by 11 pixel 
images. Figure 2c illustrates all of the resulting images 
arranged to form an overall 1 44 by 1 76 pixel layout. Fig- 
ure 2c also indicates the tree relation of pixels in various 
levels of the decmposition; indeed, a pixel y(j,k) in LH4 
is the result of filtering and subsampling of pixels x(j,k) 
in LL3: 

yQ.k) = ho(o)h(o)x(2j,2k) + ho(o)h 1 (l)x(2j,2k-l) 
hotOJhjHJx^k+l) + ho(o)h 1 (2)x(2j t 2k-2) 
ho(0)h 1 (-2)x(2j,2k +2) + ho(Q)h^(Z)x(2i t 2k-3) 
hotOh^OJxjq- 1 ,2k) + ho( Oh, (l)x(2j- 1,2k- 1) 
ho(\)h^-\)x{2\^,2k+V + ho(l)h l (2)x(2j-1,2k-2) 
ho( Oh^Jx^j- 1,2k +2) + ho( l)h t (3)x(2j-1,2k-3) 

Because the filtering plus subsampling is basically 
computing y(j,k) from the 2 by 2 area in LL3 (the values 
of h-,(k) are small except tor k=0. i), there are four pixels 
x(2j-1, 2k-1), x(2j-1.2k). x(2j.2k-1) and x(2j.2k)) in LL3 
which determine y(i.k) in LH4. Now these four pixels in 
LL3 are related to the four pixels in LH3 at the same 
positions ((2j- 1, 2k-1), (2j-1,2k), (2j.2k-1) and (2j,2k)) 
because they all were computed from essentially the 
same 16 locations in LL2. Thus the pixel y(j,k) in LH4 is 
called the parent of the related pixels z(2j-1,2k-1), z(2j- 
1,2k), z(2j,2k-1) and z(2j,2k) in LH3 and each of these 
four pixels in LH3 is a child of the parent pixel in LH4. 

After the decomposition of the original frame into 

LL4, LH4 HH1, first encode the 9 by 11 LL4 with 

PCM (pulse code modulation) which just quantizes 
each of the 99 pixel values and ignores any spatial cor- 
relations. Basically, each pixel in LL4 is the dc compo- 
nent (average) of the corresponding 16 by 16 
macroblock in the original frame, and thus is a low reso- 
lution version of the original frame. This encoding of LL4 
takes 99N bits where N bits are used to encode each 
pixel. 

Next, use a zerotree coding for each of the three 
highpass channels. In particular, first find the maximum 
of the magnitudes of the pixels in LH4, LH3, LH2, and 
LH1. Then set an initial quantization threshold, T LH 
equal to one half of the maximum of the magnitudes. 
For the example with 8-bit pixels, T LH , may be some- 
thing like 100. Then encode LH4 by placing each of the 
99 pixels into one of the following classes: (i) POS (pos- 
itive significant) if the pixel value is positive and greater 
than T LH , (ii) NEG (negative significant) if the pixel value 
is negative and magnitude is greater than To. (iii) ZTR 
(zerotree root) if the pixel value is not greater than T LH 
and also all descendant pixels (children pixels in LH3, 
LH2 children of these LH3 children pixels, and LH1 chil- 



dren pixels of these LH2 children pixels) also have mag- 
nitudes not greater than T LH , and (iv) IZ (isolated zero) 
if the pixel has magnitude not greater than T LH but at 
least one of the descendant pixels has magnitude 

5 greater than T LH . The 99 pixels in LH4 are raster 
scanned, so the encoding may take 1 98 bits with each 
pixel taking two bits. 

Then do the same steps of finding a maximum pixel 
magnitude and threshold T HL for the pixels in HL4, HL3, 

10 HL2, and HL1 , and excode with the same classes POS, 
NEG, ZTR, and IZ but using T HL . as the quantization 
threshold. Then do the same for HH4, HH3, HH2, and 
HH1 with a threshold T H h "The tranmission of these 
encodings of LH4, HL4, and HH4 added to the prior 

15 encoding of LL4 increases the resolution of the final 
reconstructed frame by using a value of ±1 .5 T LH for pix- 
els in coded POS, NEG and a value of O for pixels 
coded ZTR, IZ in LH4 and similarly using ±1.5 T HL and 
±1.5 T HH HL4and HH4. 

20 Continue the encoding for LH3, HL3, HH3, LH2, 
HL2, HH2, LH1, HL1, and HH1 using the corresonding 
\h ^hl- or t hh Note t^ 3 * a " children pixels of a zero- 
tree root pixel need not be coded because of the defini- 
tion of zerotree root, so these pixels may be skipped in 

25 the scanning and the decoder receiving the bitstream 
will fill in zeros. Note that in LH1, HL1, and HH1 there 
are no descendant pixels, so a simple zero is used 
instead of zerotree root and isolated zero. This encod- 
ing essentially is a map of the location (and sign) of sig- 

30 nificant pixels (greater than threshold). 

Then scan through the significant pixels (those 
encoded either as POS or NEG) and encode each with 
an additional bit to distinguish between the pixel value in 
the range [T X x. 1-5T XX ] and [15T XX , 2T XX ] where the 

35 subscript T XX means the appropriate threshold. 

Replace all significant pixels with zeros in LH4, 
HL4, ... HL1, HH1; the significant pixels were identified 
in the foregoing encoding and their values are listed for 
subsequent finer quantizations and encodings. Also 

40 replace T LH with T LH /2, T HL with T HL /2, and T HH with 
T HH ,/2, and repeat the encoding as POS, NEG, ZTR, or 
IZ with the new thresholds for the LH4, HL4, ... HL1, 
HH1 just modified by zeros replacing previously signifi- 
cant pixels. This essentially refines the quantization and 

45 defines new significant pixels. Again the scan the LH4, 
HL4, ... HL1, HH1 with skipping the previously signifi- 
cant pixels which were replaced by zero and transmit 
the encodings POS, NEG, ZTR, and IZ. Again, this 
repesents a further resolution incease for a frame 

so reconstructed from the code generated so far using val- 
ues for significant pixels as the the midpoints of the 
quantization ranges. 

Again, repeat the scan throng the new list of signif- 
icant pixels and encode an additional bit to distinguish 

55 between the pixel value in the upper halves of the 
appropriate quantization ranges [0.75T XX , T xx ], 
[1.25T XX , 1.5T XX ] and [175T XX , 2T XX ] and the lower 
halves of the ranges [0.5T XX , 0.75T XX ], [T XXl 1.25T XX ] 
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and[1.5T xx .1.75T xx ] 

Similarly, again replace significant pixels with zeros 
and replace the threshold with their halves: T LH /2 with 
T, u/4 T HL /2 with T HL /4, and T HH /2 with T HH /4; and 
repeat the encoding as POS, NEG.ZTR, or IZ with the 5 
new thresholds for the LH4, HL4, ... HL1 , HH1 just mod- 
ified by zeros replacing previously significant pixels. 

This successive increases in quantization threshold 
provides incrasingly higher resolution reconstructions of 
the original frame. Further, if the three initial thresholds, 10 
T h hl . and T HH , are greatly different, then the bit- 
stream of the images with the smaller threshold could 
be deferred until the larger threshold has been reduced 
to a comparable size during the iteration. For example, 
if T UH , is twice T HL . then all pixels in the LHs are less is 
than T HL . and insignificant compared to at least one 
pixel in the HLs. This means with the resolution of the - 
largest pixel in the HLs, all of the pixels in the LHs are - 
zero and are not needed for a reconstruction at this res- 
olution Conversely, if the three initial thresholds are all 20 
somewhat comparable in size, then a single threshold 
could be used for all three channels for simplicity. 

The separate PCM encoding of LL4 often removes 
the largest magnitude pixels and thereby allows the 
encoding of LH4, HL4, ... HH1 to Bill with a smaller initial 25 
quantization threshold and be more efficient than with- 
out the separate LL4 encoding. Similarly, the use of a 
separate threhold determination for each of the three 
highpass channels allows for more precise initial quan- 
tization thresholds at the cost of transmitting three initial 30 
quantization thresholds in place of a single threshold 

The overall bitstream has an initial block of bits fully 
encoding LL4, then a block of bits encoding signrficnat 
pixel location using initial quantization thresholds T LH . 
T HL T hh then a block of bits adding one bit accuracy to 35 
each of the signif icant pixels, then a block of bits encod- 
ing significnat pixel location using refined quantization 
thresholds T LH /2, T HL /2, T HH /2, then a block of bits add- 
ing one bit accuracy to each of the significant pixels 
(both with initial quantization thresholds and refined 40 
thresholds), and so forth .units target quantization 
refinement or other bitrate constraint occurs. 



full embedding of lower resolution information as the ini- 
tial portion of higher .resolution information on multiple 
scales. 



DPCM preferred embodiment 



45 



Fully embedded preferred embodiment 

The preceding first preferred embodiment fully 
codes the highest level LL image (9 by 11 LL4 in the 
example) prior to encoding LH, HL, or HH images wrth 
separate initial quantization thresholds. In contrast, the 
second preferred embodiment applies successive ref in- so 
ing quantization coding to the highest level LL image 
analogous to the reaming quantization threhsolds in the 
zerotree coding of the LH. HL. and HH images. For the 
example, the most significant bits of the PCM codes of 
the LL4 pixels would be the first 99 bit block transmitted, ss 
then the first level zerotree coding for LH, HL. and HH. 
followed by a 99 bit block for the second most significant 
bit of the LL4 PCM codes, and so forth. This provides 



The preceding first and second preferred embodi- 
ments code the highest level LL image (9 by 1 1 for LL4 
in the example) with PCM and thereby fail to take 
advantage of spatial correlations. The third preferred 
embodiment follows the first preferred embodiment but 
with DPCM coding of the higherst LL level. 
In particular, first compute a quantized value for each 
pixel value with a preset quantized step size. Next, for 
each pixel compute a difference from adjacent pixels as 
follows. Begin with the top and left boundary (quantized) 
pixels w(O.k) and wG.O) and recursively form the differ- 
ence pixels Z(j.O) = w(j.O) - wfi-I.O) for 1 *9 and 
z(Ok) = w(O.k) - w(0,k-1) for 1^ 11. Thus with 
w(0 O) and the differences z(j.O) and z(O.k) the wQ.O) 
and w(O.k) can be reconstructed, but the magnitudes of 
the ZG.O) and z(0,k) typically are much smaller and 
require fewer bits to encode. 

Next, recursively form all other difference pixels by 
prediction from the smaller derivative: zQ.k) = w(j.k) - 
wG-l k) if w(j.k-1)-w(j-l,k-1) < wG-l.k)-wG-l.k-1) or zG.k) = 
wG k) - wQ,k-l) otherwise. Again. wG.k) can be recon- 
structed from w(0,0) and the zG.k)s, but the magni- 
tudes of the zG,k)s should be smaller than those of the 
wG,k)s. Of course, w(O.O) may be large and is directly 

encoded. . . , 

■men encode the zG,k)s with an adaptive variable 
length entropy code as follows. Empirically, the z(~.k) 
fall into two classes: (A) small values, typically less than 
1 5 due to the effectiveness of the prediction, and (B) rel- 
atively large values due to the variance of the data. The 
preferred embodiment divides the code into two sets: 
one set for small values that will be coded with shorter 
length codes, and the other set for large values. 
Because within each set, the values are generally 
evenly distributed (entropy high), to simply implementa- 
tion, use the same number of bits to code each symbol. 

Next, compute the maximum of the ZG.k); call this 
maximum G. The maximum number of bits bits to encode , 
each of the remaining zQ.k) thus is: 

bits = ceil[log 2 (G+l)] + 1 where "ceil^T »s the 
ceiling function whose value is the smallest integer not 
greater than the function's argument. Then search for 
the optimal number of bits, n^ which divides the zQ.k) 
into two sets with one set coded with n opl b'te and the 
other set coded with rw 1 bits. For a total of N zG.k)s 
(this would be 98 in the example), let b be the number of 
bits required for each zQ.k) in a set of smallest magni- 
tude zfi.k)s. Then the coding gain per symbol over PCM 

oain-t^-Wt^D/N-l 
Thus find n^t by looping through all values of b (b = 
0 1 2 n 5i J and tattng n opt to be the value of b 
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Color preferred embodiment 

rate frames fYfaiE^T f ames ' ,hree s epa- 

1 76. and U and J^re haVriJ! J ? ^ $UCh as 1 44 bv '» 
Then each of the m 6 e ?.am e ?f ° n ' SUCh 33 72 by 88 > 
Agoing prefe^'^^ 
iust concatenate. V-stream C^l^ COuW 
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baseband: »C*^SSt,?^ / 

where "Qb" (8 bits! ic th* „ . CM stream, so 

baseband. -JJ? s £ IT ^ S ' 2e ,or ,he 
streaminby.es -wi .Lf h l en9 ' h ° f ,he bw *»* bit- 
"bitsO "isthVm a ,sth6( ' uar *2ed value of w(O.O) 

set of snTa/, v s e ^TorCM?? ^ for ,he » 

^rr^ssi" - bit 

Stream S^.Tf ^ r* C ° uld be " Q *» 

Empirical results 

embodiment codinr, i« ' ,he ^ pre,erred 
rnethodof ^haprnU inTh^ * ^ ^ 

rerrea embodiment rouahlv "'epre 
improvement. Y P ^ * ,0 1 IB « 

Overview of Compression and Decompression 

ernbXVviS:;::^ diagram a « 

which trantfe bo ?S S n e, S n,erenCin9) SySt6m 
speaker using pre ferrS en t!, 80 ,ma9e °' the 
encoding, deccdino ernbod,menl compression, 

-ect^ ^^SSSStS? 6rr0r 

ond recvJToZZ "ed" foTT ^ ^ S6C " 
opposite direction and a S transm,ssio ' 1 " the 
transmitters Si be ro nnecL° Xh and 
video and speech a™. «T ^ to tne system - ™e ss 
allocation SSSST*,?'*^ " ,he 
video and <«>Wdft brtwen 

s Peech may be dynamically adjusted 
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overly large time delays for very low bitrate transmis- 
sion. An I picture occurs only once every 5 or 10 sec- 
onds, and the majority of frames are P pictures. For the 
144 rows of 176 pixels size frames, roughly an I picture 
will be encoded with 20 Kbits and a P picture with 2 5 
Kbits, so the overall bitrate will be roughly 22 Kbps (only 
10 frames per second or less). The frames may be mon- 
ochrome or color with the color given by an intensity 
frame (Y signal) plus one quarter resolution (subsam- 
pled) color combination frames (U and V signals). w 

(1) Initially, encode the zeroth frame Fo as an I pic- 
ture as in MPEG- 1 ,2 using a preferred embodiment 
based on wavelet transform. Compute the multi- 
level decomposition of the frame; separate the is 
baseband (LLk if k levels are used) and encode it 
with PCM or DPCM (PCM provides simple full 
embbedding); for each of the three sets of higher 
bands (HH1, HH2, ... HHk; HL1, HL2, ... HLk; and 
LH1, LH2, ... LHk) separately zerotree encode the 20 
wavelet coefficients; and transmit in scan line order 
with the PCM of LLk interleaved for full embedding. 
Other frames will also be encoded as I frames with : 
the proportion of I frames dependent upon the 
transmission channel bitrate. If F N is to be an I pic- 25 
ture! encode in the same manner as F 0 . 

(2) For frame F N to be a P picture, detect moving * 
objects in the frame by finding the regions of 
change from reconstructed F N .1 to F N . Recon- 
structed F N . 1( is the approximation to F N which is 30 
actually transmitted as described below. Note that 
the regions of change need not be partitioned into 
moving objects plus uncovered background and will 
only approximately describe the moving objects. 
However, this approximation suffices and provides 35 
more efficient low coding. Of course, an alternative 
would be to also make this partition into moving ■ 
objects plus uncovered background through mech- 
anisms such as inverse motion vectors to deter- 
mine if a region maps to outside of the change 40 
region in the previous frame and thus is uncovered 
background, edge detection to determine the 
object, or presumption of object characteristics 
(models) to distinguish the object from background. 

(3) For each connected component of the regions 45 
of change from step (2), code its boundary contour, 
including any interior holes. Thus the boundaries of 
moving objects are not exactly coded; rather, the 
boundaries of entire regions of change are coded 
and approximate the boundaries of the moving so 
objects. The boundary coding may be either by 
splines approximating the boundary or by a binary 
mask indicating blocks within the region of change. 
The spline provides more accurate representation 

of the boundary, but the binary mask uses a smaller ss 
number of bits. Note that the connected compo- 
nents of the regions of change may be determined 
by a raster scanning of the binary image mask and 



sorting pixels in the mask into groups, which may 
merge, according to the sorting of adjacent pixels. 
The final groups of pixels are the connected com- 
ponents (connected regions). For example of a pro- 
gram, see Ballard et al, Computer Vision (Prentice 
Hall) at pages 149-152. For convenience in the fol- 
lowing the connected components (connected 
regions) may be referred to as (moving) objects. 

(4) Remove temporal redundancies in the video 
sequence by motion estimation of the objects from 
the previous frame. In particular, match a 16 by 16 
block in an object in the current frame F N with the 
16 by 16 block in the same location in the preceding 
reconstructed frame F N .-| plus translations of this 
block up to 15 pixels in all directions. The best 
match defines the motion vector for this block, and 
an approximation F'N to the current frame F N can 
be synthesized from the preceding frame F N _-, by 
using the motion vectors with their corresponding 
blocks of the preceding frame. 

(5) After the use of motion of objects to synthesize 
an approximation F' N , there may still be areas 
within the frame which contain a significant amount 
of residual information, such as for fast changing 
areas. That is. the regions of difference between F N 
and the synthesized approximation F N have motion 
segmentation applied analogous to the steps (2)- 
(3) to define the motion failure regions which con- 
tain significant information. 

(6) Encode the motion failure regions from step (5) 
using a waveform coding technique based on the 
DCT or wavelet transform. For the DCT case, tile 
the regions with 16 by 16 macroblocks, apply the 
DCT on 8 by 8 blocks of the macroblocks, quantize 
and encode (runlength and then Huffman coding). 
For the wavelet case, set all pixel values outside the 
regions to a constant (e.g., zero), apply the multi- 
level decomposition, quantize and encode (zero- 
tree and then arithmetic coding) only those wavelet 
coeff iciencts corresponding to the selected regions. 

(7) Assemble the encoded information for I pictures 
(DCT or wavelet data) and P pictures (objects 
ordered with each object having contour, motion 
vectors, and motion failure data). These can be 
codewords from a table of Huffman codes; this is 
not a dynamic table but rather generated experi- 
mentally. 

(8) Insert resynchronization words at the beginning 
of each I picture data, each P picture, each contour 
data, each motion vector data, and each motion fail- 
ure data. These resynchronization words are 
unique in that they do not appear in the Huffman 
codeword table and thus can be unambiguously 
determined. 

(9) Encode the resulting bitstream from step (8) 
with Reed-Solomon codes together with interleav- 
ing. Then transmit or store. 

(10) Decode a received encoded bitstream by 
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Reed-Solomon plus deinterleaving. The resynchro- 
nization words help after decoding failure and also 
provide access points for random access. Further, 
the decoding may be with shortened Reed -Solo- 
mon decoders on either side of the deinterleaver 5 
plus feedback from the second decoder to the first 
decoder (a stored copy of the decoder input) for 
enhanced of error correction. 
(11) Additional functionalities such as object scala- 
bility (selective encoding/decoding of objects in the 10 
sequence) and quality scalability (selective 
enhancement of the quality of the objects) which 
result in a scalable bitstream are also supported. 

The preferred embodiments may be varied in many 15 
ways while retaining one or more of their features at 
separate encoding of the highest LL band and zerotree 

encoding of LHk LHI, HLk, ... HL1, HHk, ... HHI 

bands. For example, the size of frames, decomposition 
levels, thresholds, quantization levels, symbols, and so 20 
forth can be changed. Generally, subband filtering of 
other types such as QMF and Johnston could be used 
in place of the wavelet filtering provided that the region 
of interest based approach is maintained. Images with 
one or three or more dimensions can analogously be 25 
encoded by decomposition and separate encoding of 
the higheste level lowpass filtered image. 

Claims 

30 

1. A method of encoding an image, comprising the 
steps of: 

decomposing an image into k levels of sub- 
bands by lowpass and highpass filtering; 35 
encoding the lowest subband; 
encoding the subbands other than said lowest 
subband with zerotree encoding. 
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